THUIR at TREC 2005 Terabyte Track
نویسندگان
چکیده
IR group of Tsinghua University this year has used its TMiner text retrieval system for indexing and retrieval of the Terabyte track ad hoc and named-page subtasks. In doing the two tasks, we used the in-link anchor texts (the anchor of the URLs that point to the current page in the collection) together with the content texts of the web pages for building the indices. When retrieving, the word-pair method [1] was used and proved effective on 2004 and 2005 Terabyte ad hoc task topics and the 2005 named-page task. We provide further analysis of the performance of word-pair method in comparison with the Markov random field term dependence model of [2] and another generative phrase model we proposed, which is more natural on the language modeling framework [3]. 1. TMiner at Terabyte 2005 On a PC of 2GB memory, with one CPU and IDE hard disks, TMiner could index 50GB text (about 200GB HTML files) with tolerable time. But since the terabyte collection contains about 100GB pure text (110GB including anchor texts), building one single index for such a large collection would cost TMiner too much time. We built 27 indices for the 27 parts of the collection in our experiments. When retrieving, we summed the DF values of the query terms from each index, and assigned the BM2500 RSV to documents in the collection according to the DF sum. This distributed index system returns exact RSV as if only one single index is constructed for the whole collection (at the expense of additional query processing time). In the ad hoc and named-page tasks, the index of in-link anchor combined with page content was used. This is the most effective way of combining anchor text for retrieval in our observation and we didn’t build indices that contain no in-link anchor for comparison. In addition to the use of anchor text, since the indices we built contains full position information for the
منابع مشابه
Using Normal PC to Index and Retrieval Terabyte Document - THUIR at TREC 2004 Terabyte Track
متن کامل
THUIR at TREC 2008: Relevance Feedback Track
Tsinghua University Information Retrieval Group (THUIR) has participated into the first Relevance Feedback Track of TREC2008. The TMiner search engine has been used as our text retrieval system, because the processing capability and flexibility of this system on large text data has been testified during many years’ Web Track and Terabyte Track. In the track, we studied two approaches: 1) query ...
متن کاملImproved Feature Selection and Redundance Computing - THUIR at TREC 2004 Novelty Track
This is the third years that Tsinghua University Information Retrieval Group (THUIR) participates in Novelty task of TREC. Our research on this year’s novelty track mainly focused on four aspects: (1) text feature selection and reduction; (2) improved sentence classification in finding relevant information; (3)efficient sentence redundancy computing; (4) effective result filtering. All experime...
متن کاملTHUIR at TREC 2009 Web Track: Finding Relevant and Diverse Results for Large Scale Web Search
This is the 8th year that IR group of Tsinghua University (THUIR) participates in TREC. This year we focus on Web track, which contains two tasks, namely ad hoc and diversity. On ad hoc task, we improved the efficiency of our distributed retrieval system TMiner to handle terabytes of Web data. Then three studies have been done, namely page quality estimation, ranking feature analysis, and model...
متن کاملDublin City University at the TREC 2005 Terabyte Track
For the 2005 Terabyte track in TREC Dublin City University participated in all three tasks: Adhoc, Efficiency and Named Page Finding. Our runs for TREC in all tasks were primarily focussed on the application of “Top Subset Retrieval” to the Terabyte Track. This retrieval utilises different types of sorted inverted indices so that less documents are processed in order to reduce query times, and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005